METHOD AND APPARATUS FOR SPEAKER IDENTIFICATION 



FIELD OF THE INVENTION 

This invention relates to speecii identification using cepstral covariance 
matrices and distance metrics. 

BACKGROUND OF THE INVENTION 

Automatic verification or identification of a person by their speech is 
attracting greater interest as an increasing number of business transactions are 
being performed over the phone, where automatic speaker identification is 
desired or required in many applications. In the past several decades, three 
techniques have been developed for speaker recognition, namely (1) Gaussian 
mixture model (GMM) methods, (2) vector quantization (VQ) methods, and (3) 
various distance measure methods. The invention is directed to the last class of 
techniques. 

The performance of cun-ent automatic speech and speaker recognition 
technology is quite sensitive to certain adverse environmental conditions, such 
as background noise, channel distortions, speaker variations, and the like. The 
handset distortion is one of the main factors that contribute to degradation of the 
speech and speaker recognizer. In the current speech technology, the common 
way to remove handset distortion is the cepstral mean normalization, which is 
based on the assumption that handset distortion is linear. 
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In the art of distance metrics speech identification, it is well l<nown that 
covariance matrices of speech feature vectors, or cepstral vectors, carry a 
wealth of information on speaker characteristics. Cepstral vectors are generally 
obtained by inputting a speech signal and dividing the signal into segments, 
typically 10 milliseconds each. A fast Fourier transform is performed on each 
segment and the energy calculated for each of N frequency bands. The 
logarithm of the energy for each band is subject to a cosine transformation, 
thereby yielding a cepstral vector having N elements. The frequency bands are 
not usually equally spaced, but rather are scaled, such as mel-scaled, for 
example, as by the equation mf= 1 125log(0.0016/+1), where f is the frequency 
in Hertz and mf is the mel-scaled frequency. 

Once a set of N cepstral vectors, c1 , c2 . . . cN, has been obtained a 
covariance matrix may be derived by the equation: 

S = [(c1 -m)^(c1-m) + (c2-mf (C2-/77) + . . . + {cN-my{cN-m)]IN (1 ) 

where T indicates a transposed matrix, m is the mean vector m= (c1+c2+. . 
.+cK)/K where K is the number of frames of speech signal, and S is the NxN 

covariance matrix. 

Let S and S be covariance matrices of cepstral vectors of clips of testing 
and training speech signals, respectively, that is to say that S is matrix for the 
sample of speech that we wish to identify and S is a matrix for the voice 
signature of a known individual. If the sample and signature speech signals are 
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identical, then S = S, which is to say that SS"^ is an identity matrix, and the 
speaker is thereby identified as the known individual. Therefore, the matrix SS" 
is a measure of the similarity of the two voice clips and is referred to as the 
"similarity matrix" of the two speech signals. 

The arithmetic. A, geometric, G, and hannonic, H, means of the 
eigenvalues l(/ = 1 W) of the similarity matrix are defined as follows: 

^^^^ 

G{X,,...,X,) = {f[lf''={Det{Sr')f' 

if (2b) 

H{X„...,X,) = nW r =NiTr{^S-')r 

.=1 ^ (2c) 



where Tr{) is the trace of a matrix and Def() is the determinant of a matrix. 

These values can be obtained without explicit calculation of the 
eigenvalues and therefore are significantly efficient in computation. Also, they 
satisfy the following properties: 

.A 1 . 1 



. 1 1 



H{X„.. 
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(3b) 
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Various distance measures have been constructed based upon these 
mean values, primarily for purposes of speaker identification, the most widely 
known being: 



(4a) 



G 



(4b) 



A' 
GH 



-1 



(4c) 



d,iS,I.) = A-\og(G)-l 



(4d) 



wherein if the similarity matrix is positive definite, the mean values satisfy the 
equation A>G >H\N\th equality if and only if /tj = ^2 = ... = . Therefore, all the 
above distance measures satisfy the positivity condition. However, If we 
exchange S and S (or the position of sample and signature speech signals), S 
S-i> SS-^ and l, > i/\„ and find that di satisfies the symmetric property while 62, 
ds, and d4 do not. The symmetry property is a basic mathematic requirement of 
distance metrics, therefore di is generally in more widespread use than the 
others. 

As stated, the cepstral mean normalization assumes linear distortion, but 
in fact the distortion is not linear. When applied to cross-handset speaker 
identification (meaning that the handset used to create the signature matrices is 
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different than the one used for the sample) using the Lincoln Laboratory Handset 
Database (LLHD), the cepstral mean normalization technique has an error rate in 
excess of about 20%. Consider that the en-or rate for same-handset speaker 
identification is only about 7%, and it can be seen that channel distortion caused 
by the handset is not linear. What is needed is a method to remove the 
nonlinear components of handset distortion. 



SUMMARY OF THE INVENTION 

Disclosed is a method of automated speaker identification, comprising 
receiving a sample speech Input signal from a sample handset; deriving a 
cepstral covariance sample matrix from said first sample speech signal; 
calculating, with a distance metric, all distances between said sample matrix and 
one or more cepstral covariance signature matrices; determining if the smallest 
of said distances is below a predetermined threshold value; and wherein said 
distance metric is selected from 

J,(5,E) = (^+^)(G + l)-4 

f 

2H G 

d,{S,Y) = ^-1 

^ G 
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rf,(S,E)=^+|:-2 

J 

fusion derivatives thereof, and fusion derivatives thereof with 

(/,(5,S)=^-1. 

in another aspect of the invention, the method further comprises 
identifying said sample handset; identifying a training handset used to derive 
each said signature matrix; wherein for each said signature matrix, an adjusted 
sample matrix is derived by adding to said sample matrix a distortion matrix 
comprising distortion information for said training handset used to derive said 
signature matrix; and wherein for each signature matrix, an adjusted signature 
matrix is derived by adding to each said signature matrix a distortion matrix 
comprising distortion information for said sample handset. 

In another aspect of the invention, the step of identifying said sample 
handset further comprises calculating, with a distance metric, all distances 
between said sample matrix and one or more cepstral covariance handset 
matrices, wherein each said handset matrix is derived from a plurality of speech 
signals taken from different speakers through the same handset; and 
determining if the smallest of said distances is below a predetermined threshold 
value. 

In another aspect of the invention, said distance metric satisfies symmetry 
and positivity conditions. 

In another aspect of the invention, said distance metric is selected from 
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(G + — ) 
^ G 

d,{S,'L) = -^ + — -2 

G H _ and fusion derivatives thereof. 

In another aspect of the Invention, the step of Identifying said training 
handset for each said signature matrix further comprises calculating, with a 
distance metric, all distances between said signature matrix and one or more 
cepstral covariance handset matrices, wherein each said handset matrix is 
derived from a plurality of speech signals taken from different speakers through 
the same handset; and determining if the smallest of said distances is below a 
predetermined threshold value. 

In another aspect of the invention, said distance metric satisfies symmetry 
and positivity conditions. 

In another aspect of the Invention, said distance metric is selected from 

J,(5,S) = -^-l, 
d,{S,i:) = A+'^^-2 
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J,(5,S) = (^+^)(G + ^)-4 



J,(5,S)=2^(G + 1)-1 



(G + -) 

G 



d,{S,Z)=^ + — -2 

G H ^ and fusion derivatives thereof. 

5 Disclosed is a method of automated speaker identification, comprising 

O receiving a sample speech input signal from a sample handset; deriving a 

Cf cepstral covariance sample matrix from said first sample speech signal; 

f calculating, with a distance metric, all distances between an adjusted sample 

f ; matrix and one or more adjusted cepstral covariance signature matrices, each 

□ 10 said signature matrix derived from training speech signals input from a training 

flj handset; determining if the smallest of said distances is below a predetermined 

D threshold value; wherein for each said signature matrix, said adjusted sample 

matrix is derived by adding to said sample matrix a distortion matrix comprising 

distortion Information for said training handset used to derive said signature 
15 matrix; and wherein each said adjusted signature matrix is derived by adding to 

each said signature matrix a distortion matrix comprising distortion information 

for said sample handset. 

In another aspect of the invention, said distance metric satisfies symmetry 

and positivity conditions. 

20 In another aspect of the invention, said distance metric is selected from 
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, and fusion derivatives thereof. 



In anotlier aspect of the Invention, said sample handset Is identified by a 
method comprising calculating, with a distance metric, all distances between said 
sample matrix and one or more cepstral covarlance handset matrices, wherein 
each said handset matrix Is derived from a plurality of speech signals taken from 
different speakers through the same handset; and determining if the smallest of 
said distances Is below a predetermined threshold value. 

In another aspect of the invention, said distance metric satisfies symmetry 
and positlvity conditions. 

In another aspect of the invention, wherein said distance metric Is 
selected from 
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, and fusion derivatives thereof. 



In another aspect of the invention, for each said signature matrix, said 
training handset is identified by a method comprising calculating, with a distance 
metric, all distances between said signature matrix and one or more cepstral 
covariance handset matrices, wherein each said handset matrix is derived from a 
plurality of speech signals taken from different speakers through the same 
handset; and determining if the smallest of said distances is below a 
predetemriined threshold value. 

In another aspect of the invention, said distance metric satisfies symmetry 
and positivity conditions. 

In another aspect of the invention, said distance metric is selected from 



d,{S,T)^A + 
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-1 



, and fusion derivatives thereof. 



Disclosed is a program storage device, readable by machine, tangibly 
embodying a program of instructions executable by the machine to perform 
method steps for automated speaker identification, said method steps 
comprising receiving a sample speech input signal from a sample handset; 
deriving a cepstra! covariance sample matrix from said first sample speech 
signal; calculating, with a distance metric, all distances between said sample 
matrix and one or more cepstral covariance signature matrices; determining if 
the smallest of said distances is below a predetermined threshold value; and 
wherein said distance metric is selected from 
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, fusion derivatives thereof, and fusion 



A 



derivatives thereof with di{S,'L)= -\. 

H 

Disclosed is a program storage device, readable by machine, tangibly 
embodying a program of instructions executable by the machine to perform 
method steps for automated speaker identification, said method steps 
comprising receiving a sample speech input signal from a sample handset; 
deriving a cepstral covariance sample matrix from said first sample speech 
signal; calculating, with a distance metric, all distances between an adjusted 
sample matrix and one or more adjusted cepstral covariance signature matrices, 
each said signature matrix derived from training speech signals input from a 
training handset; determining if the smallest of said distances is below a 
predetennined threshold value; wherein for each said signature matrix, said 
adjusted sample matrix is derived by adding to said sample matrix a distortion 
matrix comprising distortion information for said training handset used to derive 
said signature matrix; and wherein each said adjusted signature matrix is derived 
by adding to each said signature matrix a distortion matrix comprising distortion 
information for said sample handset. 
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BRIEF DESCRIPTION OF THE DRAWINGS 



Figure 1 is a flowchart of an embodiment of tiie invention. 
Figure 2 is a grapii of experimental data. 
Figure 3 is a grapli of experimental data. 

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

Referring to Figure 1 , the process of the invention begins at node 10 
wherein a cepstral covariant sample matrix S is generated from a speech signal 
received from a test subject whom the practitioner of the invention wishes to 
identify. The derivation may be by any one of a number of known methods such 
as those described in A. Cohen et al., On text-independent speaker identification 
using automatic acoustic segmentation, ICASSP, pp. 293-296, 1985; and S. B. 
Davis et al., Comparison of parametric representations for monosyllabic word 
recognition in continuously spoken sentences, IEEE, 1980, the disclosures of 
both of which are incorporated by reference herein in their entirety. 

At node 20, the optional step of determining the test subject's handset is 
executed. In real-world applications, it is most probable that the handset used to 
create signature matrices are different than the one used to receive the test 
subject's voice for the sample matrix. The distortion added to the matrices by 
the different handsets blurs the similarity of the voices and results in higher 
misidentification rates. 

The method of detennining the test subject's handset is by calculating the 
distances between the sample matrix and a database of handset matrices, each 
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representative of a particular make and model of handset. The shortest distance 
determines the identity of the test subjects handset. Such a method is 
described in commonly assigned copending US patent application Wang et al., 
METHOD AND APPARATUS FOR HANDSET IDENTIFICATION, filed 

5 , US Serial No. , attorney docket No, 

YOR9-2001 -0450(8728-530), the disclosures of which are incorporated by 
reference herein in their entirety. 

The generation of handset matrices is performed for a particular handset 
by having a substantial number of different speakers provide speech samples 

10 through the handset, preferably at least ten such samples, more preferably at 
least twenty. A cepstral covariance matrix is then generated for the handset 
from all the samples, thereby creating a handset matrix M. Because all the 
speakers are different, the speaker characteristics of the covariance matrix are 
smeared away, leaving the handset information. 

15 In a preferred embodiment, a database of handset matrices will be kept 

and updated periodically as new makes and models of handsets come on the 
market. 

Flow now moves to optional node 30 where a database of signature 
matrices is access and the first signature matrix S retrieved. 
20 Control now flows to node 40 where the optional adjustment of the sample 

and signature matrices is performed. This part of the process corrects the 
cross-handset distortion caused by the test subject using a different handset 
then that which was used to generate the signature matrix ~ the training 
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handset. This is done by first adding to the sample nnatrix S a distortion matrix 
Dh corresponding to the training handset. Of course, if it has been determined 
that the test subject is using the same handset as the training handset, then this 
step may be skipped altogether, though executing it anyway will do no harm. 
Preferably, information identifying the training handset will be stored with each 
signature matrix for rapid identification. A slower method would be to test the 
signature matrix in the same manner the sample matrix was tested at node 20. 

To generate the distortion matrix, the handset matrix of the training 
handset is multiplied by a scaling factor I, such that: 

Dh = IMh (5) 

where Dh is the distortion matrix for handset h and Mh is the handset matrix taken 
over all the speech samples for handset /?. The scaling factor 1 will be chosen so 
as to provide the greatest accuracy of speaker identification. This is determined 
experimentally, but may be expected to be approximately equal to about 0.4 or 
0.8 as can be seen in Figure 2. The graph in Figure 2 was generated from the 
example described below. 

By adding the sample and distortion matrices, there is generated a 
adjusted sample matrix S' that now contains information about the distortion 
caused by the training handset. Note that S' already has information about the 
distortion of the handset used by the test subject because that was the handset 
used to generate S. 
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Note, however, that if this is a cross-handset situation, then the signature 
matrix must also be adjusted. Therefore, the signature matrix is added to the 
distortion matrix corresponding to the handset detected at node 20, namely the 
handset being used by the test subject. Now both the adjusted sample matrix S' 
and the adjusted signature matrix S' have distortion information for both 
handsets in addition to voice information. 

Control now flows to node 50 where the distance between the adjusted 
sample S' and adjusted signature S' matrices is calculated. Because we have 
adjusted for cross-handset situations, the distance will be a function of the 
difference in voice information rather than handset information. As stated above, 
there are four well known distance formulae in use as are described in H. Gish, 
Robust discrimination in automatic spealor identification, Proceedings ICASSP 
1990, vol. 1 , pp. 289-292; F. Bimbot et al.. Second-order statistical measures for 
test-independent speaker identification, ECSA workshop on automatic speaker 
recognition, identification and verification, 1994, pp. 51-54; and S. Johnson, 
Speaker tracl<ing, Mphil thesis, University of Cambridge, 1997, and references 
therein; the disclosures of all of which are incorporated by reference herein in 
their entirety. Of those, the first di is the most favored for its symmetry and 
positivity. To this collection may be added five new inventive distance measures: 



d,iS,i:)^A-^l~2 

H (6a) 
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all of which satisfy the positivity and symmetry conditions. Along with di, these 
distance metrics may be fused in any combination as described in K. R. Farrell, 
Discriminatory measures for speal<er recognition. Proceedings of Neural 
Networks for Signal Processing, 1995, and references therein, the disclosures of 
which are incorporated by reference herein in their entirety. The example at the 
end of this disclosure demonstrates how fusion is accomplished. 

Control now flows through nodes 60 and 30 in a loop until the distances 
between the adjusted sample matrix S' and every adjusted signature matrix S' is 
calculated. 

After all the signature matrices have been run through and distances 
calculated for all of them, control flows to node 70 where the smallest distance is 
examined to determine if it remains below a predetermined threshold value. If 
not, control flows to termination node 80, indicating that the sampled voice failed 
to match any of those in the signature database. If, however, the distance Is 
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below the chosen threshold, then control flows to termination node 90, indicating 
that a positive identification has been made. 

The method of the invention may be embodied in a software program on a 
computer-readable medium and rigged so that the identification process initiates 
5 as soon as a call comes in and the person on the line has spoken his first words. 

Example 

An LLHDB (Lincoln Laboratory Handset Database) corpus of recorded 
utterances was used, such as is described in D. A. Reynolds, HTIMITand 

10 LLHDB: speech corpora for the study of handset transducer effects, ICASSP, 
pp. 1535-1538, May 1977, Munich, Germany, the disclosures of which are 
incorporated by reference herein in their entirety. Twenty eight female and 24 
male speakers were asked to speak ten sentences extracted from the TIMIT 
corpus and the rainbow passage (from the LLHDB corpus) over nine handsets 

15 and a Sennheizer high-quality microphone. The average length of the spoken 
rainbow passages was 61 seconds. In this experiment, the rainbow passage 
was used for generating signature matrices, and the remaining utterances for 
sample matrices. One handset chosen at random was designated "cbl" and 
another "cb2". These are the actual handsets used for same-handset and 

20 cross-handset testing. 

A 13 static mel-cepstra and a 13 delta mel-cepstra were calculated from a 
five frame interval. For each utterance, one full covariance matrix was 
calculated. The results are shown in Table I: 
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TABLE I 
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As can be seen from the data in Table I, inclusion of delta cepstral in the 
cepstral vectors greatly improves accuracy. It can also be seen that some of the 
novel distance measures yield results comparable to that of the widely used di. 
A data fusion of di and dr was performed to yield a new distance metric: 

di.7 = (1 - a)di + adr (7) 

where a is referred to as the fusion coefficient and 0 £ a £ 1 . The linear 

combination of two symmetric and positive distance metrics yields a fused metric 

that is also symmetric and positive. 

The distances between the handset matrices and the sample matrices 

were calculated for each distance formula and the error rates of handset 

detection derived, resulting in the data of Table II as follows: 
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TABLE II 
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The error rate of handset detection was graphed using distance metric di, 
as can be seen in Figure 2 . From the graph of Figure 2, it was decided to set 
the distortion scaling factor 1= 0.4 for this experiment. It should be noted that the 
setting of an optimal distortion scaling factor is optional because, as can be seen 
from Figure 2, optionally setting != 1 .0 does not introduce that much more error. 
The distortion scaling factor is largely independent of the distance metric used to 
graph it, but rather is dependent upon handset characteristics. Therefore, if 
highly accurate results are desired, a different distortion scaling factor graph 
should be generated for every different handset pair. 

Distortion matrices were generated for each handset and the distances 
calculated to the speech samples. 

Figure 3 shows the error rate of speaker Identification as a function of the 
fusion coefficient a. The experiment demonstrates that when N= 26 and a=0.25, 
the en-or rate obtained is only 6.17% for the fused metric, as opposed to 6.94% 
for the di metric alone - an 1 1 % improvement. 

It can therefore be seen that the invention provides good speaker 
identification performance with a variety of choices of symmetrical and positive 
distance metrics. It can be seen that the use of addition of delta cepstra in the 
cepstral vectors can decrease the error rate by as much as 38% and the use of 
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data fusion with novel distance metrics further decreases error rates by about 
11%. A further reduction in error rates of about 17% may be obtained through a 
novel method of cross-handset adjustment. 

It is to be understood that all physical quantities disclosed herein, unless 
explicitly indicated otherwise, are not to be construed as exactly equal to the 
quantity disclosed, but rather about equal to the quantity disclosed. Further, the 
mere absence of a qualifier such as "about" or the like, is not to be construed as 
an explicit indication that any such disclosed physical quantity is an exact 
quantity, irrespective of whether such qualifiers are used with respect to any 
other physical quantities disclosed herein. 

While preferred embodiments have been shown and described, various 
modifications and substitutions may be made thereto without departing from the 
spirit and scope of the invention. Accordingly, it is to be understood that the 
present invention has been described by way of illustration only, and such 
illustrations and embodiments as have been disclosed herein are not to be 
construed as limiting to the claims. 
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